By iandix, May of 2018
In this tutorial we’ll build a Neural Network architecture known as Multi-Layer Perceptron to predict the species of a flower based on some of its physical attributes. We’ll do this with the R version of the very popular Keras API.
Keras is a high-level neural networks API developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. Keras has the following key features:
Keras was originally developed as a Python Library and only recently was ported to R with about the same functionality as the original version.
The Keras package already comes with the following built-in datasets:
CIFAR10 & CIFAR100: The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset commonly used in image classification tasks.
IMDB: Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative).
Reuters: Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with [dataset_imdb()], each wire is encoded as a sequence of word indexes (same conventions).
MNIST: Dataset of 60,000 28x28 grayscale images of the 10 fashion article classes, along with a test set of 10,000 images
Boston Housing: Boston housing price regression is a dataset taken from the StatLib library which is maintained at Carnegie Mellon University
Although these datasets are quite helpful for learners and well known in the Data Science community, for this specific tutorial we’ll choose another one which is really simple and easy to work with: the Iris Dataset.
This is perhaps the best known database to be found in the pattern recognition literature. It is a classic in the field and was structured by R.A Fisher in 1936 still being referenced frequently to this day. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. The predicted attribute is the class of iris plant.
Most of the dataset attributes are measures over the Iris flowers sepal and petal:
Sepal is the outer parts of the flower (often green and leaf-like) that enclose a developing bud. Sepals in most of the cases remain green and make the outer whorl of flowers and their basic function is protect the male and female sex organ in flower.
Petal is the parts of a flower that are often conspicuously colored. Petals in most of cases lies next inner to sepals and they function as attractant. They are variously colored and attract bees, birds etc so that pollination can be done.
Class is related to the flower species and can assume three varieties: Iris Setosa, Iris Versicolour and Iris Virginica.
For the Iris flowers the sepal and petal are both colorful, being not easily distinguishable, as one can see in the following picture:
We all know nowadays Machine Learning is the icing on the cake. After all, as evangelized by the legendary Andrew Ng, Artificial Intelligence is the New Electricity. To get there though, it’s necessary to go through a laborious process. In R, this process steps are as follows:
So let’s load the Iris data from the University of California, Irvine archive. To do this we’ll read the dataset CSV file directly from an URL an then run some basic instructions to verify that the import process was sucessful:
# Load the Iris Dataset
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)
# Verify the first few items
head(iris)
# Inspect data structure
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ V1: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ V2: num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ V3: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ V4: num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ V5: Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...
# Check Dimensions
dim(iris)
## [1] 150 5
As demonstrated by the ‘head’ function, the imported CSV data has multiple types. The columns V1-V4 are ‘Doubles’ while the column V5 is a ‘Factor’. This tabular structure is called a Data Frame.
In order to allow a more didatic exploration, lets add names to the ‘data.frame’ columns:
# Adding names to columns
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
One interesting exploration that can be done is to verify the correlations among attributes. This can be done by means of a Correlogram which is a graphical representation of a correlation matrix. It is very useful to highlight the most correlated variables in a data table. In this plot, correlation coefficients are colored according to their values.
library(corrplot)
## corrplot 0.84 loaded
# Computing the correlation matrix
M <- cor(iris[,1:4])
# Plotting the correlation matrix
corrplot(M, method = "circle")
corrplot(M, method = "number")
We can see visually or numerically that some attributes have a positive correlation (shades of blue) whereas other ones have a negative correlation (shades of red).
To further investigate, lets plot an exemplary positive and negative correlation by using a scatter plot:
library(ggvis)
# Show positive correlation of 0.96 between petal length and width
iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>%
layer_points()
# Show negative correlation of -0.11 betweem sepal lenght and width
iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>%
layer_points()
Working with the data in a Data Frame format is quite convenient for exploratory purposes. It happens that for processing the data with the Keras library we’ll need to convert it to a purely numerical representation:
# Converting from Factor labels to numbers
iris[,5] <- as.numeric(iris[,5]) - 1
# Turning Iris into a matrix (matrices are purely numeric)
iris <- as.matrix(iris)
# Removing column labels by setting Iris 'dimnames' to 'NULL'
dimnames(iris) <- NULL
The ‘Species’ attribute (column 5) data type is ‘Factor’. Factors are categorical variables that are super useful in summary statistics, plots, and regressions. We’re converting “setosa”, “versicolor”, and “virginica”, to the numeric values 0, 1, and 2.
Next, we can proceed to verify the effect of normalizing the data. Data normalization is the process of rescaling one or more attributes to the range of 0 to 1. This means that the largest value for each attribute is 1 and the smallest value is 0.
library(keras)
# Summarizing the data before normalization
summary(iris)
## V1 V2 V3 V4
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## V5
## Min. :0
## 1st Qu.:0
## Median :1
## Mean :1
## 3rd Qu.:2
## Max. :2
hist(iris, main = "Before Normalization")
# Normalizing the data
iris_norm <- normalize(iris[,1:4], -1, 2)
# Summarizing the new normalized dataset
summary(iris_norm)
## V1 V2 V3 V4
## Min. :0.6539 Min. :0.2384 Min. :0.1678 Min. :0.01473
## 1st Qu.:0.7153 1st Qu.:0.3267 1st Qu.:0.2509 1st Qu.:0.04873
## Median :0.7549 Median :0.3544 Median :0.5364 Median :0.16415
## Mean :0.7516 Mean :0.4048 Mean :0.4550 Mean :0.14096
## 3rd Qu.:0.7884 3rd Qu.:0.5252 3rd Qu.:0.5800 3rd Qu.:0.19753
## Max. :0.8609 Max. :0.6071 Max. :0.6370 Max. :0.28042
hist(iris_norm, main = "After Normalization")
As we can see by the histogram entitled ‘After Normalization’ all the data is now in the interval [0,1].
Another common operation is to standardize our numeric features. Data Standardization is the process of rescaling one or more attributes so that they have a mean value of 0 and a standard deviation of 1. Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian.
We won’t standardize our Iris data. Actually, since our data spans along a very restricted range, initially there wouldn’t have even enough reason to normalize the data.
The usual motivations to normalize the features/attributes are:
Based on the previous conclusions, we’ll keep working with the original Iris data, without normalizing it.
It’s time to organized the data into Training and Test Sets. The R libraries has plenty of utility functions to help with this task:
# Create a vector of indices 1 and 2 with 2/3 of 1's and 1/3 of 2's
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.67, 0.33))
# Use indices to select data and compose training and test sets
iris.training.X <- iris[ind == 1, 1:4]
iris.test.X <- iris[ind == 2, 1:4]
# Also, split the labeled attribute
iris.training.y <- iris[ind == 1, 5]
iris.test.y <- iris[ind == 2, 5]
The ‘sample’ function takes the vector [1:2] as input, sets the number of items and generates randomly an index vector contaning 2/3 of 1’s and 1/3 of 2’s. The ‘replace’ variable when defined TRUE means that after each sampling, the whole set is preserved [1:2], meaning that the elements are not imediatelly extracted from the original vector.
Then we use a vector expression to generate a logical vector to be used as an index vector:
head(ind)
## [1] 1 1 1 2 1 1
head(ind == 2)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
Now that we have our training and test sets organized let’s enter the Machine Learning party.
Our motivating question was: How can we predict the Iris flower species (Versicolor, Setosa or Virginica) based on its sepal and petal attributes?
In order to be able to make such predictions we’ll now build the Machine Learning Model. One of the central Keras concepts is that of a Sequential Model: a linear stack of layers. To create a sequential model and add some layers:
library(keras)
# Initializing the sequential model
model <- keras_model_sequential()
# Adding layers to the model
model %>%
layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>%
layer_dense(units = 3, activation = 'softmax')
Note: the pipe operator in R (%>%) is quite common these days and avoid a lot of parentheses when chainning function calls. For a more detailed discussion on the pipe operator read the article Pipes in R: Tutorial for Beginners.
In the Iris case, we are constructing a Mutil Layer Perceptron which is a kind of Feed Forward Neural Network. It will work as a multi-class classifier.
Our NN will have:
Having created the Sequential Model, we can at any time get information of it by using some utility functions:
# Printing summary info for the model
summary(model)
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 8) 40
## ___________________________________________________________________________
## dense_2 (Dense) (None, 3) 27
## ===========================================================================
## Total params: 67
## Trainable params: 67
## Non-trainable params: 0
## ___________________________________________________________________________
# Getting model configuration
get_config(model)
## [{'class_name': 'Dense', 'config': {'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'distribution': 'uniform', 'scale': 1.0, 'seed': None, 'mode': 'fan_avg'}}, 'name': 'dense_1', 'kernel_constraint': None, 'bias_regularizer': None, 'bias_constraint': None, 'dtype': 'float32', 'activation': 'relu', 'trainable': True, 'kernel_regularizer': None, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'units': 8, 'batch_input_shape': (None, 4), 'use_bias': True, 'activity_regularizer': None}}, {'class_name': 'Dense', 'config': {'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'distribution': 'uniform', 'scale': 1.0, 'seed': None, 'mode': 'fan_avg'}}, 'name': 'dense_2', 'kernel_constraint': None, 'bias_regularizer': None, 'bias_constraint': None, 'activation': 'softmax', 'trainable': True, 'kernel_regularizer': None, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'units': 3, 'use_bias': True, 'activity_regularizer': None}}]
# Getting layer configuration
get_layer(model, index = 1)
## <keras.layers.core.Dense>
# Listing model's layers
model$layers
## [[1]]
## <keras.layers.core.Dense>
##
## [[2]]
## <keras.layers.core.Dense>
# Listing input tensors
model$inputs
## [[1]]
## Tensor("dense_1_input:0", shape=(?, 4), dtype=float32)
# Listing the output tensors
model$outputs
## [[1]]
## Tensor("dense_2/Softmax:0", shape=(?, 3), dtype=float32)
The information produced by the ‘summary’ function show that we have 2 dense layers with a total of 67 parameters. This is because we have:
An interesting point to note is that while listing information about our input and output layers, the term Tensor appears. Tensors are generalizations of vectors and matrices to potentially higher dimensions. Tensors are common vocabulary when working with the TensorFlow library (We shall not forget that Keras is a high level API that uses TensorFlow beneath as a backend calculation engine).
Before training a model, we have to configure the learning process via the ‘compile’ function:
# Compiling the model
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = 'accuracy'
)
The ‘compile’ function has the following parameters:
The ‘categorical_crossentropy’ and ‘accuracy’ are pretty obviuos choices for our model. For the optimizer, in general terms, since our network is not as deep, and the training process is quick enough, we could test different optimizers and verify the results. According to this Quora post though Adam is often a solid choice, but for a deeper discussion you definetly should check Andrej Karpathy CS231 Lecture Notes which is quite comprehensive regarding learning and hyperparemeters.
Before finally training our Multi Layer Perceptron, we still have one more thing to do: change the encoding of the output labels to a One Hot Encoding Format. This way for every input example the respective output label will be a vector with 3 elements being two of them zeroed unless one in accordance with respective class:
# One hot encoding the training target values
iris.training.y.onehot <- to_categorical(iris.training.y)
# One hot encoding the test target values
iris.test.y.onehot <- to_categorical(iris.test.y)
# Printing out exemplary test data after one hot
print(head(iris.test.y.onehot))
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 1 0 0
## [3,] 1 0 0
## [4,] 1 0 0
## [5,] 1 0 0
## [6,] 1 0 0
Done in a very easy way with the help of keras auxiliary function ‘to_categorical’! Lets proceed with the training:
# Training the model
history <- model %>%
fit(
iris.training.X,
iris.training.y.onehot,
epochs = 200,
batch_size = 8,
validation_split = 0.2
)
plot(history)
# Evaluate on test data and labels
score <- model %>%
evaluate(iris.test.X, iris.test.y.onehot, batch_size = 128)
# Print the score
print(score)
## $loss
## [1] 0.3790988
##
## $acc
## [1] 0.7567568
As an exercise, if we had instead used only 100 epochs for training the network, we would had achieved a bigger loss and lower accuracy. The reality is that since our dataset is small in size we need to go across multiple rounds of training.
Althoug the score of loss and accuracy are pretty intuitive the plot of history is not that much. Let’s try to plot them separately to see if we can get a cleaner view:
# Plotting model loss for training data
plot(history$metrics$loss, main="Model Loss", xlab = "epoch", ylab="loss", col="blue", type="l")
# Plotting model loss for test data
lines(history$metrics$val_loss, col="green")
# Adding legend
legend("topright", c("train","test"), col=c("blue", "green"), lty=c(1,1))
# Plotting accuracy for training data
plot(history$metrics$acc, main="Model Accuracy", xlab = "epoch", ylab="accuracy", col="blue", type="l")
# Plotting accuracy for validation data
lines(history$metrics$val_acc, col="green")
# Add Legend
legend("bottomright", c("train","test"), col=c("blue", "green"), lty=c(1,1))
With the model trained and fitted to the data, we can now use our test set to predict labels (Iris flower class/species) over new inputs.
# Predicting classes for test data
classes <- model %>%
predict_classes(iris.test.X, batch_size = 128)
classes
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 1 2 2 2 1 1 1 1 1 1 2
## [36] 1 1
We can use Keras to calculate loss and accuracy:
# Evaluating over test data and labels
score <- model %>% evaluate(iris.test.X, iris.test.y.onehot, batch_size = 128)
# Printing the score
print(score)
## $loss
## [1] 0.3790988
##
## $acc
## [1] 0.7567568
A general way of evaluating model performance is by means of Confusion Matrices.
A Confusion Matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a *predicted class while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).
In the case of a binary classifier for instance, with classes named Pos and Neg, we would have:
Some definitions:
Another common representation of the Confusion Matrix is shown below:
Where,
Note that in this view the actual values are shown as rows whereas the predictied ones are shown as columns.
In this new nomenclature the previous metrics are:
\(Accuracy=\frac{TP+TN}{\#Examples}\)
\(Precision=\frac{TP}{TP+FP}\)
\(Recall=\frac{TP}{TP+FN}\)
Intuitively, we can say that:
Precision and recall avoid the problem with accuracy when dealing with skewed datasets. We shall note that there is a trade-off between precision and recall.
Ok., that’s a lot about Confusion Matrices for binary classification problems. For the case of a multi-class classifier:
In this terms:
\(Accuracy=\frac{\#Correct\ Classifications}{\#Classifications}\)
\(Precision\ of\ A=\frac{TP}{TP+FP}=\frac{TP_A}{TP_A+E_{BA}+E_{CA}+E_{DA}+E_{EA}}\)
\(Recall\ of\ A=\frac{TP}{TP+FN}=\frac{TP_A}{TP_A+E_{AB}+E_{AC}+E_{AD}+E_{AE}}\)
With Keras we can calculate the Confusion Matrix for our multi-classification problem using the ‘table’ function:
# Printing the Confusion Matrix
cm = table(iris.test.y, classes)
cm
## classes
## iris.test.y 0 1 2
## 0 13 0 0
## 1 0 7 0
## 2 0 9 8
# Accuracy
mean(iris.test.y == classes)
## [1] 0.7567568
# Precision for class 0
cm[1,1]/(cm[1,1] + cm[2,1] + cm[3,1])
## [1] 1
# Recall for class 0
cm[1,1]/(cm[1,1] + cm[1,2] + cm[1,3])
## [1] 1
# Precision for class 1
cm[2,2]/(cm[2,2] + cm[1,2] + cm[3,2])
## [1] 0.4375
# Recall for class 1
cm[2,2]/(cm[2,2] + cm[2,1] + cm[2,3])
## [1] 1
# Precision for class 2
cm[3,3]/(cm[3,3] + cm[1,3] + cm[2,3])
## [1] 1
# Recall for class 2
cm[3,3]/(cm[3,3] + cm[3,1] + cm[3,2])
## [1] 0.4705882
Even though our Neural Network is capable of learning its internal parameters, there are external ones known as Hyperparameteres.
In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.
Sometimes it can be difficult to choose a correct architecture for Neural Networks. Usually, this process requires a lot of experience because networks include many parameters. Some of the most important parameters that we can optimize for the neural network:
Even though the list of parameters in not even close to being complete, it’s still impressive how many parameters influences network’s accuracy.
Lets make some experiments.
# Initialize the sequential model
model2 <- keras_model_sequential()
# Add layers to model
model2 %>%
layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>%
layer_dense(units = 5, activation = 'relu') %>%
layer_dense(units = 3, activation = 'softmax')
# Compile the model
model2 %>%
compile(
loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = 'accuracy'
)
# Fit the model to the data
history <- model2 %>%
fit(
iris.training.X,
iris.training.y.onehot,
epochs = 200,
batch_size = 8,
validation_split = 0.2
)
plot(history)
# Evaluate the model
score <- model2 %>%
evaluate(
iris.test.X,
iris.test.y.onehot,
batch_size = 128
)
# Print the score
print(score)
## $loss
## [1] 0.1903814
##
## $acc
## [1] 0.9189189
save_model_hdf5(model2, "mlp-iris.h5")
#model2 <- load_model_hdf5("mlp-iris.h5")
#save_model_weights_hdf5("mlp-iris.h5")
#model2 %>% load_model_weights_hdf5("mlp-iris.h5")
#json_string <- model_to_json(model2)
#model2 <- model_from_json(json_string)
#yaml_string <- model_to_yaml(model2)
#model2 <- model_from_yaml(yaml_string)